WEEK 4

Analysis

Boston Data

Boston data set about housing values in suburbs of Boston has 14 variables and 506 observations, of which 12 are continuous numerical variables and 2 discrete integers. It includes several variables of possible factors that might influence on housing values which are listed below.

-crim: per capita crime rate by town.

-zn: proportion of residential land zoned for lots over 25,000 sq.ft.

-indus: proportion of non-retail business acres per town.

-chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

-nox: nitrogen oxides concentration (parts per 10 million).

-rm: average number of rooms per dwelling.

-age: proportion of owner-occupied units built prior to 1940.

-dis: weighted mean of distances to five Boston employment centres.

-rad: index of accessibility to radial highways.

-tax: full-value property-tax rate per $10,000.

-ptratio: pupil-teacher ratio by town.

-black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

-lstat: lower status of the population (percent).

-medv: median value of owner-occupied homes in $1000s.

# load the data from MASS package
data("Boston")
## check how the data look like (structure and dimension etc.)
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

To explore variables and their relationship further, I created plots about their distributions and correlations.

# Plots 
p <- ggpairs(Boston, mapping = aes(alpha = 0.5), lower = list(combo = wrap("facethist", bins = 20)))
p1 <- p + ggtitle("Housing values in suburbs of Boston")
p1

Figure 1. Boston data variable

# correlation plot
cor_matrix<-cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d",tl.cex=0.6)

Figure 2. Correlations

From distribution plots in Figure 1, it can be seen that almost all variables do not follow normal distribution. Only one variable, average numbers of rooms per dvelling (rm), seems to follow normal distribution. Other variables are either right skewed (crim, zn, chas, nox,dis, lstat and medv), left skewed (age, ptratio, and black) or clearly bimodal (indus, rad and tax). Also there seems to strong correlation between many of the variables. Correlations are plotted also in the Figure 2, where red indicates negative correlation and blue positive correlation and the size and color density how strong the correlation is. Darker and bigger indicates stronger correlation than small and light. For example, between lower status of the population (lstat) and median value of owner-occupied homes in $1000s (medv), has strong negative correlation. Whereas, between index of accessibility to radial highways (rad) and full-value property-tax rate per $10000 (tax) there is strong positive correlation. Whether these observations are statistically significant, can be confirmed by statistical tests.

## summary of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Skewnes of the variables can also been observed in the summary statistics (above). In crim, interval between min 0.00632 and max 88,97 is wide, but median is only 0.256. Results are difficult to interpret because of the scale used, so the data should be standardized to make it easier (below). Now the results are comparable with each other and easier to interpret.

# standardize data and print out the summary
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#names(boston_scaled)

Linear Discrimant Analysis (LDA)

Linear discrimant analysis (LDA) is a classification method that can be used to find those variables that best separate classes, to class prediction of new data and for dimensional reduction. Target variable can be either binary or multiclass variable. LDA assumes that variables follow normal distribution and each class share same covariance matrix. To perform with the target variable crime, data is divided into train and test.

#create a categorical variable
scaled_crim <- boston_scaled$crim

# summary of the scaled_crim
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)


# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Divide dataset to test and train sets
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
#  train set
train <- boston_scaled[ind,]
# and test set 
test <- boston_scaled[-ind,]

# linear discriminant analysis on the train set 
lda.fit <- lda(crime ~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2574257 0.2450495 0.2376238 
## 
## Group means:
##                  zn      indus         chas        nox         rm
## low       0.9409768 -0.9015249 -0.084848104 -0.8736702  0.4247167
## med_low  -0.1135082 -0.2576328 -0.007331936 -0.5449535 -0.1088236
## med_high -0.3729012  0.1737975  0.165126514  0.3953306  0.1167668
## high     -0.4872402  1.0172418 -0.108283225  1.0659835 -0.4271663
##                 age        dis        rad        tax     ptratio
## low      -0.8550002  0.8598096 -0.6996764 -0.7314427 -0.44708494
## med_low  -0.3206475  0.2754719 -0.5478832 -0.4510342 -0.05940589
## med_high  0.3819466 -0.3929399 -0.4227186 -0.3212120 -0.32052420
## high      0.8183463 -0.8653103  1.6368728  1.5131579  0.77931510
##                black       lstat         medv
## low       0.38593802 -0.75884629  0.516783100
## med_low   0.31131803 -0.11247059  0.009815111
## med_high  0.07455151  0.01044852  0.211476508
## high     -0.90674677  0.93780690 -0.709518409
## 
## Coefficients of linear discriminants:
##                 LD1           LD2         LD3
## zn       0.10421979  0.6963638900 -0.87623981
## indus    0.03977506 -0.2537820201  0.29710233
## chas    -0.04075613 -0.0047725687  0.07528764
## nox      0.44539291 -0.8724281106 -1.41063077
## rm       0.04033224 -0.0648843955 -0.10293019
## age      0.22330544 -0.1891520043 -0.18449982
## dis     -0.06621316 -0.2482051644 -0.03634122
## rad      3.41785854  0.9510175825 -0.21895872
## tax      0.00065747  0.0008216975  0.68946457
## ptratio  0.13003811 -0.0045906986 -0.25369643
## black   -0.11288449  0.0338764238  0.15261219
## lstat    0.23258159 -0.2542925956  0.42029270
## medv     0.12818572 -0.5198363264 -0.16823018
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9552 0.0334 0.0113
# Plot
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes, main="LDA")
lda.arrows(lda.fit, myscale = 2)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

Figure 3. LDA

Figure 3 represents the LDA results, in which different colors represent different classes and the arrows show the impact of each predictor variables. Accessibility to radial highways seems to be strongly associated with high crime rates.

Classes can be predicted from the LDA model by using test data and this infromation can be used to test the performance of the model. Correct values are crosstabulated against predicted values.

# save test crime
correct_classes <- test$crime
## test set for prediction
test <- dplyr::select(test, -crime) 
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# and cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class) 
##           predicted
## correct    low med_low med_high high
##   low       15       6        1    0
##   med_low    4      16        2    0
##   med_high   1      11       13    2
##   high       0       0        0   31

Model performs well with high class, all predictions are correct. Prediction of low is correct 16 times but predicted 2 times to be medium low. Medium low is predicted to be correct 15 times and wrong 14 times, so model performs weakly with then medium low values. Medium high is predicted to be correct 20 times and incorrect 4 times. Model thus performs well with all other except medium low.

Distance and K-means clustering

K-means is a method for clustering which uses the similarity of the objects to grouping or clustering. Number of clusters that best fit for the data can be determined for example by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. Optimal number can be seen from figure 4 from the point where WCSS drops radically. This happens when there are 2 clusters.

data("Boston")
#scale
boston_scaled <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results to find optimal number of clusters
plot(1:k_max, twcss, type='b')

Figure 4. K-means clustering

# Plot 
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Figure 5. Variables by two cluster

In Figure 5, scaled variables are presented by the cluster. The difference between the clusters can be seen in almost all pictures. There is clear difference for example crime and accessibility to radial highways which was seen also in the LDA plot.